Text Indexing and Searching in Sublinear Time

نویسندگان

  • J. Ian Munro
  • Gonzalo Navarro
  • Yakov Nekrich
چکیده

We introduce the first index that can be built in o(n) time for a text of length n, and also queried in o(m) time for a pattern of length m. On a constant-size alphabet, for example, our index uses O(n log n) bits, is built in O(n/ log n) deterministic time, and finds the occ pattern occurrences in time O(m/ logn + √ logn log logn + occ), where ε > 0 is an arbitrarily small constant. As a comparison, the most recent classical text index uses O(n log n) bits, is built in O(n) time, and searches in time O(m/ logn+ log logn+occ). We build on a novel text sampling based on difference covers, which enjoys properties that allow us efficiently computing longest common prefixes in constant time. We extend our results to the secondary memory model as well, where we give the first construction in o(Sort(n)) time of a data structure with suffix array functionality, which can search for patterns in the almost optimal time, with an additive penalty of O( √ logM/B n log logn), where M is the size of main memory available and B is the disk block size. Cheriton School of Computer Science, University of Waterloo. Email [email protected]. CeBiB — Center of Biotechnology and Bioengineering, Department of Computer Science, University of Chile. Email [email protected]. Funded with Basal Funds FB0001, Conicyt, Chile. Cheriton School of Computer Science, University of Waterloo. Email: [email protected].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Opportunistic Data Structures with Applications

There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and references therein). The motivation has to be found in the exponential increase of electronic data nowadays available which is even surpassing the significant increase in memory and disk storage capacities of current computers. Space reduction is an attractive issue because it is also...

متن کامل

Analysis of algorithms and data structures for text indexing

Large amounts of textual data like document collections, DNA sequence data, or the Internet call for fast look-up methods that avoid searching the whole corpus. This is often accomplished using tree-based data structures for text indexing such as tries, PATRICIA trees, or suffix trees. We present and analyze improved algorithms and index data structures for exact and error-tolerant search. Affi...

متن کامل

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching∗

The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg |Σ| bits by encoding each symbol with lg |Σ| bits. The goal...

متن کامل

A Comparing between the impacts of text based indexing and folksonomy on ranking of images search via Google search engine

Background and Aim: The purpose of this study was to compare the impact of text based indexing and folksonomy in image retrieval via Google search engine. Methods: This study used experimental method. The sample is 30 images extracted from the book “Gray anatomy”. The research was carried out in 4 stages; in the first stage, images were uploaded to an “Instagram” account so the images are tagge...

متن کامل

Order-Preserving Matching with Filtration

The problem of order-preserving matching has gained attention lately. The text and the pattern consist of numbers. The task is to find all substrings in the text which have the same relative order as the pattern. The problem has applications in analysis of time series like stock market or weather data. Solutions based on the KMP and BMH algorithms have been presented earlier. We present a new s...

متن کامل

وضعیت بازیابی اطلاعات در دو پایگاه نمایه و نما و سنجش اثربخشی استفاده از واژگان کنترل ‌شده در نمایه‌سازی این دو پایگاه

Purpose: This study was carried out to determine the level of precision, recall, and searching time for “Nama” and “Namayeh” databases, as well as to find out which of the indexing tools (thesaurus and Dewey decimal classification) helps us more in improvement of information retrieval. Methodology: This study is an analytical survey in which the necessary data was collected by direct observati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1712.07431  شماره 

صفحات  -

تاریخ انتشار 2017